Linguistic Problems Based on Text Corpora

نویسندگان

  • Boris Iomdin
  • Alexander Piperski
  • Anton Somin
چکیده

The paper is focused on self-contained linguistic problems based on text corpora. We argue that corpus-based problems differ from traditional linguistic problems because they make it possible to represent language variation. Furthermore, they often require basic statistical thinking from the students. The practical value of using data obtained from text corpora for teaching linguistics through linguistic problems is shown.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Process Model for Composing High-quality Text Corpora

The Teko corpus composing model offers a decentralized, dynamic way of collecting high-quality text corpora for linguistic research. The resulting corpus consists of independent text sets. The sets are composed in cooperation with linguistic research projects, so each of them responds to a specific research need. The corpora are morphologically annotated and XML-based, with in-built compatibilt...

متن کامل

Interoperability of Corpora and Annotations

This paper describes the application of OWL and RDF to address the interoperability of linguistic corpora and linguistic annotations within such corpora. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability (annotations of different origin are linked to ...

متن کامل

Discontinuous Constituents: a Problematic Case for Parallel Corpora Annotation and Querying

In this paper, we discuss some linguistic phenomena that pose potential problems for multilevel linguistic annotation of parallel corpora in general and specifically for data encoding with state-of-art multilevel corpus querying tools such as CQP. We describe the strategy we use for integrating the standard hierarchical XML representation used to annotate such phenomena in our aligned bilingual...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos. Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. We evaluate our approach on a collection of Youtube videos as well as two l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013